Untitled

Networks

A listers are circles, B are crosses.

*Big force-directed network

Leaving out the few unconnected festivals from this one for better layout. Zoomable; hover to see labels, click on nodes to see connections.

*As a matrix

Color is n shared films (edge strength). This is actually more informative than a network plot, as it easily shows both degree and strength for all pairs of festivals/nodes, while a network plot

  • hides information due to overlaps, such as long-distance links
  • can only effectively show degree (strength as edge weight is lost is the mess)
  • can create incidental visual artefacts that can mislead interpretation due to the semi-random layout.

So tbh we should probably use a matrix instead, but I’m aware networks look aesthetically cooler, and that some of us have interest in having a network-person profile. So maybe a network and a matrix side by side, would that be useful…? Interactive networks are of course a bit better, as you can zoom into the mess (try above).

Temporally arranged network

Interactive nodes yearwise, with hover labels:

Probably not that useful:

Network stats

Degree

Keep in mind that degree stats should only be interpreted for the “middle” of the database in terms of years - festivals in the earilest as well as latest years have lower degree simply because there is nothing/not much to link to in the past/future that is not in the database, plus early years of the database had considerably less festivals in it.

*Raw values

(correlates moderately with festival entry size)

Normalized degree

Degree normalized by dividing with the log10 number of unique films in a festival entry; roughly on a similar scale to the raw values, but almost decorrelated from the entry size and therefore comparable between festivals, and between iterations of a festival series.

Content types (“genres” in the database)

25833 unique films, 576 unique fests, 35118 film*fest pairs, after final filter. Latent space of content/genre tag similarities is inferred directly from the data, based on content type co-occurrence statistics.

Big UMAP

Possible alternative: if drama is main, add blue ring around point, but show secondary as main color.

Example count plots for explanation

(will be something nicer soon)

*Diversity

The diversity values are scaled into a range of [0,1], using the largest possible distance in the latent genre space as the scaling value (and multiplied by 2 in the case of internal diversity, as it’s comparison to internal mean), and are thus interpretable.

  • 0 internal diverity: all films are of the same metadata type
  • 1 internal: the films are as different as can be
  • 0 external: festival looks exactly like the ecosystem grand mean
  • 1 external: festival is as distant from the grand mean as possible in the given latent space.

Also, note the error bars: they represent 95% confidence intervals from a nonparametric bootstrapping, i.e. the true mean value (the dots) are likely to be within the given range. Smaller festival samples and festivals with more variation around the mean for a given statistic therefore have larger bar ranges. While not quite equivalent to a pairwise significance test, as a rule of thumb, if the intervals of two points of interest overlap more (more than just a bit; intervals are more conservative than pairwise tests), then the difference is likely not statistically significant.

*Change over time

As difference between consequtive years, in festivals with at least 11 iterations. Confidence interval interpretation same as above.



*Festival profiles by film age

Shows mean difference between event year and production years of shown films. Unit on both axes is years of difference. Some noise added on x-axis due to lots of overlapping values from the integer calculations. Zoom in to see where most of the datapoints are. High sd means festival shows a mix of newer and older films.

NB: here the bars are standard deviation, not confidence intervals. Note how Cannes seesaws between showing newer and older (retrospective?) films. Some show films from the future (final production year is later than event).


Comparison with genre internal diversity


*Languages

24727 unique films, 547 unique fests, 43187 film*fest pairs, after final filter.

*Big UMAP

*Diversity

Again, the diversity values are scaled into a range of [0,1], using the largest possible distance in the latent space (the interlanguage distances) as the scaling value, and are thus interpretable:

  • 0 internal diverity: all films are of the same metadata type
  • 1 internal: the films are as different as can be
  • 0 external: festival looks exactly like the ecosystem grand mean
  • 1 external: festival is as distant from the grand mean as possible in the given latent space.



Geography

27063 unique films, 584 unique fests, 53137 film*fest pairs, after final filter.

Note: since production location is at the precision of country, I’m using the coordinates of capital cities for all calculations here, with the exception of the US, where Los Angeles coordinates are used.

*Diversity

All in kilometers - these could in principle also be scaled (by the circumference of the globe divided by 2), but are probably more informative as real-valued statistics.

*Latent centers of festivals

Each dot is the average production-country coordinates of the films of one festival (weighted, if coproduction). Latent centre is the big + , Cannes is the black dot. I removed the geographical borders/continents, as it may motivate the false interpretations that festivals take place in the sea; but added some geo labels to ease orientation. Obviously no dimension reduction is needed as the space is already plottable on a 2-dimensional projection (by longitude & latitude), albeit technically being a sphere for calculation purposes.

For reference, the actual host contries of the festivals: since these overlap a lot, they are scattered a bit (but internally still located at capital city coordinates, regardless of actual host city)

Distance between event host country and its latent (film-based) geographical centre

In kilometers; “eventdist” on the tooltip.

* Comparison of language and production country (internal) diversity

Interestingly correlates only weakly (R2=0.096, p<0.0001).





Example questions

*What predicts linkage/flow between festivals

Simple binomial logistic regression model predicting linkage (at least 1 film shared between a pair of festivals) by the similarity of their metadata (no control for festival series here yet). In short, all the predictors are significant (likely non-random).

  • Event time distance is more of a control, as we know films circulate for a few years usually (the negative \(\beta\) coefficient estimate means that the more the two festivals are apart in time, the less likely they are to share films - makes sense).
  • The class categorical variable baseline is that the pair is of different type, so if both are A and A, they are more likely to share a film; if both are B, then they are less likely to share (note the imbalance though, theres more all sorts of B fests)
  • All the other distances are also negative: e.g. the more the mean content/genre vectors differ (higher content distance), the less likely they are to share films (again makes sense).
  • The only one is event location distance - here, festivals are more likely to share films if they are further away from each other - I suppose filmmakers wnat to travel the world with their films, and not stick to the same region? But this effect may also vary between regions themselves, or content types, or who knows what else (which could be further modelled and measured of course).

Given suitable hypotheses, could also test interactions. Importantly, this modelling is all based on festival means, and individual films are not taken into account - but it could also be done this way, to model the likelyhood that a pair of films share a festival (instead of a pair of festivals sharing a film), which might make more sense.

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5794  -0.2952  -0.0879  -0.0144   5.6632  

Coefficients:
              Estimate Std. Error  z value Pr(>|z|)    
(Intercept)    1.03888    0.03848   26.995  < 2e-16 ***
timedist      -1.63251    0.01564 -104.401  < 2e-16 ***
class_bothAA   0.13604    0.03845    3.538 0.000403 ***
class_bothBB  -0.15953    0.02515   -6.342 2.26e-10 ***
genre         -5.90956    0.18649  -31.689  < 2e-16 ***
lang          -4.94911    0.22468  -22.027  < 2e-16 ***
geomean       -1.16421    0.23381   -4.979 6.38e-07 ***
eventloc       1.17418    0.14601    8.042 8.84e-16 ***
eventfilmdist -2.88983    0.30467   -9.485  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 77718  on 146610  degrees of freedom
Residual deviance: 48931  on 146602  degrees of freedom
AIC: 48949

* What predicts number of shared films

Or strength of edges, for festivals that share at least 2 films. Dependent variable is logged - but it’s a strong power law kind of thing, so linear model assumptions are not exactly met i.e. this is a bit wonky. Interpretation similar to above, here negative coefficient estimate means negative, positive means positive correlation to number of shared films.

Coefficients:
              Estimate Std. Error t value             Pr(>|t|)    
(Intercept)    1.51365    0.02393  63.252 < 0.0000000000000002 ***
timedist      -0.02539    0.01281  -1.983               0.0475 *  
class_bothAA   0.16847    0.02430   6.933     0.00000000000466 ***
class_bothBB  -0.08251    0.01741  -4.739     0.00000220276470 ***
genre         -1.46203    0.14899  -9.813 < 0.0000000000000002 ***
lang          -1.35904    0.16252  -8.362 < 0.0000000000000002 ***
geomean       -0.17614    0.17612  -1.000               0.3173    
eventloc      -0.18121    0.10436  -1.736               0.0825 .  
eventfilmdist -0.52085    0.22052  -2.362               0.0182 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5569 on 4944 degrees of freedom
Multiple R-squared:  0.07302,   Adjusted R-squared:  0.07152 
F-statistic: 48.68 on 8 and 4944 DF,  p-value: < 0.00000000000000022

Compare all festival metrics + Is sundance more like A or B?




Appendix

*Genre (content type) latent space, UMAP dimension reduction.

*Language latent space, UMAP

(this map is hard to project in 2D well, because many languages are very far from other languages due to no relation, having only incidental similarities or geographic proximity)

Plots for paper